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ABSTRACT 

ChEMBL is an Open Data database containing bind- 
ing, functional and ADMET information for a large 
number of drug-like bioactive compounds. These 
data are manually abstracted from the primary pub- 
lished literature on a regular basis, then further 
curated and standardized to maximize their quality 
and utility across a wide range of chemical biology 
and drug-discovery research problems. Currently, 
the database contains 5.4 million bioactivity meas- 
urements for more than 1 million compounds and 
5200 protein targets. Access is available through a 
web-based interface, data downloads and web ser- 
vices at: https://www.ebi.ac.uk/chembldb. 

INTRODUCTION 

A wealth of information on the activity of small molecules 
and biotherapeutics exists in the literature, and access to 
this information can enable many types of drug discovery 
analysis and decision making. For example: selection 
of tool compounds for probing targets or pathways of 
interest; identification of potential off-target activities of 
compounds which may pose safety concerns, explain 
existing side effects or suggest new applications for old 
compounds; analysis of structure-activity relationships 
(SAR) for a compound series of interest; assessment of 
in vivo absorption, distribution, metabolism, excretion 
and toxicity (ADMET) properties; or construction of pre- 
dictive models for use in selection of compounds poten- 
tially active against a new target (1-5). Access to this 
information is especially important due to the continuing 
shift in fundamental research on disease mechanisms from 
the private to public sectors. 



However, bioactivity data published in journal articles 
are usually found in a relatively unstructured format and 
are labour-intensive to search and extract. For example, 
compound structures are frequently depicted only as 
images and are not therefore searchable, protein targets 
may be referred to by a variety of synonyms or abbrevi- 
ations with no reference to any database identifiers, and 
details of assays may be included only in Supplementary 
Data or by reference to previous publications. In addition, 
there is not currently any requirement by most journals for 
authors to deposit small-molecule assay results in public 
databases (as is the case for sequence, protein structure 
and gene expression data). Historically, therefore, the 
majority of the published small-molecule bioactivity data 
have only been readily available via commercial products. 

In recent years, in response to the growing demand for 
open access to this kind of information, a variety of 
public-domain bioactivity resources have been developed. 
PubChem BioAssay (6) and ChemBank (7) are large 
archival databases providing access to millions of de- 
posited screening results, typically from high-throughput 
screening (HTS) experiments. A number of other primary 
resources extract bioactivity data from literature, but tend 
to focus on particular thematic areas, and primarily on 
binding affinity information. For example, BindingDB 
contains quantitative binding constants manually ex- 
tracted from publications, focusing chiefly on proteins that 
are considered to be potential drug targets (8). PDBBind 
(9), Binding MO AD (10) and AffinDB (11) contain bind- 
ing affinity information for protein-ligand complexes 
found in the Protein Data Bank (PDB, 12). PDSP Ki 
database stores screening data from the National 
Institute of Mental Health's Psychoactive Drug 
Screening Program (13). BRENDA provides binding con- 
stants for enzymes (14), IUPHAR contains ligand infor- 
mation for receptors and ion channels (15), while GLIDA 
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(16) and GPCRDB (17) provide information specifically 
for G-protein-coupled receptors. Other resources, such 
as DrugBank, provide detailed annotation around 
the properties and mechanism of action of approved 
drugs (18). 

However, in order to make informed decisions in drug 
discovery or to design experiments to probe a biological 
system with chemical tools, it is important to consider not 
only the binding affinity of a compound for its target, but 
also its selectivity, efficacy in functional assays or disease 
models and the likely ADMET properties of the com- 
pound. Moreover, researchers need the ability to intelli- 
gently cluster relevant information across studies (based 
on target or compound similarities, for example) and to 
integrate data across therapeutic areas. ChEMBL aims to 
bridge this gap by providing broad coverage across a 
diverse set of targets, organisms and bioactivity measure- 
ments reported in the scientific literature, together with a 
range of user-friendly search capabilities (19). 



DATA CONTENT 

Data extraction and curation 

The core activity data in the ChEMBL database are 
manually extracted from the full text of peer-reviewed sci- 
entific publications in a variety of journals, such as Journal 
of Medicinal Chemistry, Bioorganic Medicinal Chemistry 
Letters and Journal of Natural Products. The set of jour- 
nals covered is by no means comprehensive, but is selected 
to capture the greatest quantity of high-quality data in a 
cost, and time-effective manner. From each publication, 
details of the compounds tested, the assays performed and 
any target information for these assays are abstracted. 

Structures for small molecules are drawn in full, in 
machine-readable format, despite the structure often being 
provided as a scaffold and a list of R-group substituents, 
or referred to only by name in the original publication. 
Information about the particular salt form tested is also 
captured, where available, although this is often inconsist- 
ent in the literature. Before loading to the database, struc- 
tures are checked for potential problems (e.g. unusual 
valence on atoms, incorrect structures for common com- 
pounds/drugs), then normalized according to a set of 
rules, to ensure consistency in representation (e.g. com- 
pounds are neutralized by protonating/deprotonating 
acids and bases to ensure a formal charge of zero where 
possible). Preferred representations are used for certain 
common groups (e.g. sugars, sulphoxides and nitroxides). 
Some chemical structures are typically only reported in an 
implicit format, and this is checked and assigned on regis- 
tration — for example, the stereochemistry of the steroid 
framework is invariably not published, but is assumed to 
be that of the naturally occurring configuration, unless 
otherwise defined. Common salts are also stripped from 
the extracted compounds, and both the salt form and the 
parent compound are entered into the database. This 
allows users to view all data associated with the same par- 
ent compound, regardless of the salt form tested, while 
still retaining the salt information if required. 



Details of all types of assays performed are extracted 
from each publication, including binding assays (measur- 
ing the interaction of the compound with the target 
directly), functional assays (often measuring indirect ef- 
fects of the compound on a pathway, system or whole 
organism) and ADMET assays (measuring pharmaco- 
kinetic properties of the compound, interaction with key 
metabolic enzymes or toxic effects on cells/tissues). The 
activity endpoints measured in these assays are recorded 
with the values and units as given in the paper, but for the 
purposes of improved querying are also standardized, 
where possible, to convert them to a preferred unit of 
measurement for a given activity type (e.g. IC50 values 
are displayed innM, rather than uM/mM/M, half-life is 
reported in hours rather than minutes/days/weeks). This 
enables the user to more easily compare data across dif- 
ferent assays. 

To maximize the utility of bioactivity data, the targets 
of assays need to be represented robustly and consistently, 
in a manner independent of the various adopted names 
and synonyms used across different sources. To this end, 
detailed manual annotation of targets is carried out within 
ChEMBL. Where the intended molecular target of an 
assay is reported in a publication, this information is ex- 
tracted, together with associated details of the relevant or- 
ganism in which the assay was performed (or the organism 
from which the protein/cell-line was derived for an in vitro 
assay). Target assignments are carefully checked by our 
curators, and corrected where necessary, then further 
annotated where any ambiguity exists. For example, for 
an in vitro binding assay, it is often possible to determine 
the precise protein target with which the compound is 
interacting and assign a single relevant protein to the 
assay. However, in other cases this may not be possible. 
For example, an assay may describe interaction of a 
compound with a target which is known to be a protein/ 
biomolecular complex (e.g. ribosomes, GABA-A recep- 
tors or integrins). In this case, several protein subunits 
may be assigned to the assay, but a 'complex' field in 
the database is used to record the fact that these 
proteins are associated as a specific protein complex. In 
other cases, the assay performed may not allow elucida- 
tion of the precise protein subtypes with which a 
compound is interacting (e.g. cell/tissue-based assays 
where several closely related subtypes of the protein are 
likely to be expressed, or those reported prior to the dis- 
covery of particular receptor/enzyme subtypes). Again, 
the assay may therefore be mapped to each of the possible 
protein targets, but a 'multi' field in the database records 
the fact that it is not clear whether the compound is inter- 
acting non-specifically with all of these proteins, and con- 
sequently less confidence should be placed in these 
assignments. 

In many cases, such as whole organism-based pheno- 
typic assays, it is not possible to unambiguously determine 
the protein target that is responsible for the observed 
effect of the compound. In these cases, the assay will be 
mapped to a ChEMBL target representing the non- 
molecular system on which an effect is observed. For 
example, an assay measuring the cytotoxicity of a 
compound against the human breast carcinoma-derived 
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MCF-7 cells would be mapped to a ChEMBL cell-line 
target representing MCF-7. An in vitro assay measuring 
inhibition of growth of Mycobacterium tuberculosis would 
be mapped to a ChEMBL organism target representing 
M. tuberculosis. This allows users to easily retrieve infor- 
mation about other assays performed on the same systems, 
even though the underlying mechanism of action of the 
compounds might be different. Protein targets are further 
classified into a manually curated family hierarchy, 
according to nomenclature commonly used by drug 
discovery scientists (e.g. ligand-based classification of 
G-protein-coupled receptors, division of enzymes into 
proteases/kinases/phosphatases etc.), and organisms are 
classified according to a simplified subset of the NCBI 
taxonomic structure (20). This also allows data to be 
queried at a higher level (e.g. for all protein kinases or 
Mycobacterium species). 

Approved drugs 

In addition to literature-derived data, ChEMBL also con- 
tains structures and annotation for Food and Drug 
Administration (FDA)-approved drugs. For each drug 
entry, any information about approved products (from 
the FDA Orange Book, 21) including their trade names, 
administration routes, dosage information and approval 
dates is included in the database. Structures for novel drug 
ingredients are manually assigned, and for protein thera- 
peutics, amino-acid sequences may be included, where 
available. Each drug is also annotated according to the 
drug type (synthetic small molecule, natural product- 
derived small molecule, antibody, protein, oligosacchar- 
ide, oligonucleotide, inorganic etc.), whether there are 
'black box' safety warnings associated with a product 
containing that active ingredient, whether it is a known 
prodrug, the earliest approval date (where known), 
whether it is dosed as a defined single stereoisomer or 
racemic mixture, and whether it has a therapeutic appli- 
cation (as opposed to imaging/diagnostic agents, additives 
etc.). This information allows users of the bioactivity data 
to assess whether a compound of interest is an approved 
drug and is therefore likely to have an advantageous 
safety/pharmacokinetic profile or be orally bioavailable, 
for example. 

Data model 

The most important entity types within ChEMBL are 
documents (from which the data are extracted), com- 
pounds (substances that have been tested for their bio- 
activity), assays (individual experiments that have been 
carried out to assess bioactivity) and targets (the proteins 
or systems being monitored by an assay). Each extracted 
document has a list of associated compound records and 
assays, which are linked together by activities (i.e. the 
actual endpoints measured in the assay with their types, 
values and units). 

Since the same compound may have been tested multiple 
times in different assays and publications, the compound 
records are collapsed, based on structure, to form a non- 
redundant molecule dictionary. Standard IUPAC 
Chemical Identifier (InChI) representation (22) is used to 



determine which compounds are identical and which 
should be registered with new identifiers. In general, the 
Standard InChI representation distinguishes stereoisomers 
of a compound, but not tautomers. Hence, stereoisomers 
will be given unique identifiers, but tautomers will not. We 
have taken the view that although a particular binding 
interaction may involve a specific ionization or tautomer 
state, in a biological assay, there will be interconversion 
and equilibration across these forms. A smaller number of 
protein therapeutics and substances with undefined struc- 
tures are also included in the molecule dictionary. 
Additional information is then associated with the 
entries in this table, such as structure representations, 
calculated properties, synonyms, drug information and 
parent-salt relationships. 

Similarly, a non-redundant target dictionary stores a list 
of the proteins, nucleic acids, subcellular fractions, 
cell-lines, tissues and organisms that are subject to inves- 
tigation. Each assay is then mapped to one or more entries 
in this dictionary, as described above. Further informa- 
tion, such as protein family classification, is also linked 
to the target dictionary. 

Each record in the documents, assays, molecule dictionary 
and target dictionary tables is assigned a unique ChEMBL 
identifier, which takes the form of a 'CHEMBL' prefix 
followed immediately by an integer (e.g. CHEMBL25 is 
the compound aspirin, CHEMBL210 is the human P-2 
adrenergic receptor target). In addition, external identi- 
fiers are recorded for these entities where possible. For 
example, all small molecule compounds with defined struc- 
tures are assigned ChEBI identifiers (23) and Standard 
InChlKeys. Where data are taken from other resources, 
the original identifiers are also retained (e.g. SIDs and 
AIDs for PubChem substances and assays, HET codes 
for PDBe ligands). PubMed identifiers or Digital Object 
Identifiers (DOIs) are stored for documents (20,24). 
Protein targets are represented by primary accessions with- 
in the UniProt protein database (25), and organism targets 
are assigned NCBI taxonomy IDs and names. 

Data exchange 

The PubChem BioAssay database accepts deposited results 
from many laboratories and screening centres and contains 
a large quantity of data, primarily from high-throughput 
screening experiments, measuring inhibition of a target by 
large numbers of compounds, often at a single compound 
concentration. As such, the number of data points within 
PubChem is huge, but a very small proportion of these 
represent compounds with dose-response measurements 
(e.g. IC50, Ki) of an affinity likely to specifically perturb 
a biological system. In contrast, due to extraction from 
published pharmacology and drug discovery literature, 
ChEMBL contains a much larger proportion of active 
compounds identified using dose-response assays. The 
number of distinct protein targets with dose-response 
measurements recorded in PubChem is also smaller (cur- 
rently fewer than 700 proteins, compared with more than 
4000 in ChEMBL). However, there are also novel protein 
targets in PubChem that are not currently included in 
ChEMBL. Therefore, the types of data reported in 
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PubChem and ChEMBL are distinct and complementary. 
To maximise the utility of the two data sets to users, we 
have worked with the PubChem group to develop a data 
exchange mechanism. All ChEMBL literature-derived 
assays are now included in PubChem BioAssay, and a 
subset of PubChem assays (confirmatory and panel assays 
with dose-response endpoints) have been loaded into 
ChEMBL. Assays from PubChem are clearly marked, 
both on the ChEMBL interface and in the database, allow- 
ing users to easily determine where data have originated, 
while benefiting from being able to retrieve more informa- 
tion through a single point of access. 

Similarly, compounds and binding measurements from 
ChEMBL have been integrated into BindingDB, and the 
reciprocal incorporation of BindingDB data into 
ChEMBL is planned. 

Current content 

Release 1 1 of the ChEMBL database contains informa- 
tion extracted from more than 42 500 publications, 
together with several deposited datasets, and data drawn 
from other databases (Table 1). In total, there are more 
than 1 million distinct compound structures represented 
in the database, with 5.4 million activity values from more 
than 580000 assays. These assays are mapped to 8200 
targets, including 5200 proteins (of which 2388 are 
human). 

DATA ACCESS 

The ChEMBL interface 

The ChEMBL database is accessible via a simple, user- 
friendly interface at: https://www.ebi.ac.uk/chembldb. 
This interface allows users to search for compounds, tar- 
gets or assays of interest in a variety of ways. 

For example, users wishing to retrieve potential tool 
compounds for a target of interest can perform a 
keyword search of the database using a protein name, 
synonym, UniProt accession or ChEMBL target identifier 
of interest. Alternatively, targets can be browsed accord- 
ing to protein family (e.g. to retrieve all chemokine recep- 
tors), or organism (e.g. to retrieve all Plasmodium 
falciparum targets). Since the database only 



includes protein targets for which bioactivity data are 
available, users can also perform a BLAST search of the 
ChEMBL target dictionary with a protein sequence of 
interest. This can be useful to identify closely related 
proteins with activity data, even if the sequence of 
interest is not represented in the database (e.g. activity 
data for a mouse orthologue of a human target). 

Having retrieved a target, or multiple targets, of interest, 
a simple drop-down menu allows users to display all 
associated bioactivity data, or to filter the available data 
to select activity types of interest (for example to include 
only IC50 and Ki measurements below a given concentra- 
tion threshold, or only certain ADMET endpoints, see 
Supplementary Figure 1). The resulting bioactivity table 
gives details of each compound that was tested (together 
with the particular salt form used in the assay), the 
measured activity type, value and units, a description of 
the assay, details of the target (including the organism) 
and, importantly, a link to the publication from which 
the data have been extracted. Data from this view can 
be exported as a text file or spread sheet for further 
analysis. 

Alternatively, users may have a particular compound of 
interest and wish to retrieve potency, selectivity or 
ADMET information for this, or closely related com- 
pounds. Again, users can search for compounds using a 
keyword search with names/synonyms or ChEMBL iden- 
tifiers. However, a more effective strategy will often be to 
search by compound structure. The interface provides a 
choice of several different drawing tools (26), allowing 
users to sketch in a structure or substructure of interest 
(Figure 1). A compound similarity or substructure search 
of the database (implemented using the Accelrys Direct 
Oracle Cartridge: http://accelrys.com/products/informat- 
ics/cheminformatics/accelrys-direct.html) can then be 
carried out to retrieve ChEMBL compounds similar to, 
or containing, the input structure. 

Having retrieved a list of compounds of interest, a 
variety of calculated properties such as molecular 
weight, calculated lipophilicity (AlogP, 27) and polar 
surface area (28) can be viewed and filtered via a graphical 
display. This may be useful to restrict the set of com- 
pounds to those that are likely to have appropriate 



Table 1. Sources of compound and bioactivity data in ChEMBL_ll 
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Number of 


Number of 


Number of 


Number of 


Number of 


Number of 




compound 


assays 


activity 


targets 


protein 


organisms 




structures 




results 




targets 




ChEMBL literature extraction 


629 943 


580624 


3 282 945 


7 957 


5104 


1552 


PubChem BioAssay" 1 


364203 


1636 


2 079 974 


681 


647 


63 


GSK TCAMS Malaria Data (32) 


13467 


6 


81 198 


3 


0 


2 


PDBe Ligands 


12337 


0 


0 


0 


0 


0 


Novartis-GNF Malaria Data (33) 


5675 


4 


22 788 


3 


0 


2 


St Jude Children's Hospital Malaria Data b (34) 


1524 


16 


5456 


8 


0 


5 


Guide to Receptors and Channels (35) 


560 


344 


801 


239 


239 


6 


Sanger Institute Genomics of Drug Sensitivity in Cancer 


17 


352 


5984 


352 


0 


1 



"PubChem BioAssay set includes only confirmatory/panel assays from PubChem that have dose-response end points. 

b Only compounds with dose-response measurements from the St Jude malaria screening data set have been incorporated into ChEMBL, but the full 
high-throughput screening data can be downloaded from the ChEMBL-NTD website: https://www.ebi.ac.uk/chemblntd. 
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Figure 1. Retrieving bioactivity data with a substructure search. A choice of sketchers allows the user to enter a structure of interest and search the 
database for compounds similar to, or containing that substructure (a). The resulting list of compounds can then be filtered graphically, according to 
their physicochemical properties (e.g. calculated lipophilicity AlogP and molecular weight) using the sliders and 'update chart' button (b). When a 
suitable compound set has been created, a drop-down menu allows the user to retrieve all relevant bioactivity results from the database, or filter the 
results further by activity type (c). 
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Figure 2. Compound report card for Fingolimod (CHEMBL3 14854) showing synonyms, approved drug features (see Supplementary Figure 2), 
a link to retrieve clinical trial data, calculated compound properties and structure representations, and different salt forms of the molecule (in this 
case, a hydrochloride salt). The lower portion of the page has a series of clickable widgets, showing breakdown of the activity data for this 
compound by activity type (e.g. IC50, EC50), assay type (e.g. binding/functional/ADMET) or target type (e.g. enzyme, receptor). Clicking on a 
portion of one of the pie charts takes the user directly to the relevant bioactivity results. 



drug-like properties (29), before retrieving or filtering the 
associated bioactivity data. 

For each of the main data types in ChEMBL (com- 
pounds, targets, assays and documents), report card 
pages are available. These provide further details about 
the entity of interest, such as names and synonyms (for 



targets and compounds), journal/abstract details (for 
documents), drug annotation, structures and calculated 
physicochemical properties (for compounds), together 
with cross-references to other resources (e.g. UniProt, 
PDBe, ChEBI, DrugBank and CiteXplore: http://www 
.ebi.ac.uk/citexplore). Each report card also contains a 
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series of clickable graphical 'widgets' summarizing and 
providing rapid access to all of the bioactivity data avail- 
able for that entity (Figure 2). 

A table view of approved drugs is also provided, with 
relevant annotation (e.g. drug type, administration route, 
'black box' safety warnings) indicated by a series of 
sortable icons (see Supplementary Figure 2). Users can 
download the structures for these drugs or go to report 
cards to access further information, such as bioactivity 
data. 

Downloads and web services 

While the ChEMBL interface provides the functionality 
required for many common use-cases, some users may 
prefer to download the database and query it locally (for 
use in large-scale data mining, to integrate with their own 
proprietary data, or due to data security policies around 
the use of chemical structures at their institutions, for 
example). Each release of ChEMBL is freely available 
from our ftp site in a variety of formats, including 
Oracle, MySQL, an SD file of compound structures and 
a FASTA file of the target sequences, under a Creative 
Commons Attribution-ShareAlike 3.0 Unported license 
(http://creativecommons.Org/licenses/by-sa/3.0). 

In addition, a set of RESTful web services is provided 
(together with sample Java, Perl and Python clients), to 
allow programmatic retrieval of ChEMBL data in XML 
or JSON formats (see https://www.ebi.ac.uk/chembldb/ws 
for more details). 

Finally, to allow greater interoperability of the ChEMBL 
data with molecular interaction and pathway data (e.g. for 
annotation of pathways with chemical tools), a subset of 
the database (compounds active in binding assays against 
protein targets) is available in PSI-MITAB 2.5 format (30) 
via PSICQUIC web services (31). 



SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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